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Abstract. We study a character-based phylogeny reconstruction problem when an incomplete set 

of data is given. More specifically, we consider the situation under the directed perfect phylogeny 

assumption with binary characters in which for some species the states of some characters are miss- 
04 , ing. Our main object is to give an efficient algorithm to enumerate (or list) all perfect phylogenies 

that can be obtained when the missing entries are completed. While a simple branch-and-bound 
^^ ' algorithm (B&B) shows a theoretically good performance, we propose another approach based on 

a zero-suppressed binary decision diagram (ZDD). Experimental results on randomly generated 
5-H ' data exhibit that the ZDD approach outperforms B&B. We also prove that counting the number 

of phylogenetic trees consistent with a given data is #P-complete, thus providing an evidence that 

an efficient random sampling seems hard. 



1 Introduction 



ry\ ' One of the most important problems in phylogenetics is reconstruction of phylogenetic trees. In this 



paper, we focus on the character-based approach. Namely, each species is described by their characters, 
and a mutation corresponds to a change of characters. However, in the real-world data not all states of 
all characters are observable or reliable, which makes the data incomplete. Thus, we need a methodology 
that can cope with such incompleteness. 

Following Pe'er et al. |12] . we work with the perfect phylogeny assumption, which means that the set 
K^ of all nodes with the same character state induces a connected subtree. All characters are binary, namely 

.,— I. , take only two values. Without loss of generality, assume that these two values are encoded by and 1. 

00 ' Then, the phylogeny is directed in a sense that for each character a mutation from to 1 is possible 

OQ , only once, but a mutation from 1 to is impossible (this is also called the Camin-Sokal parsimony [5]). 

^^ ' We consider the situation where for some species the states of some characters are unknown. Under this 

CO . setting, Pe'er et al. [12] provided a polynomial-time algorithm to reconstruct a phylogenetic tree that 

C^ ' can be obtained when the unknown states are completed, if it exists. 

Although their algorithm can find a phylogenetic tree efficiently, it does not take the likelihood into 
account. This motivates people to look at optimization problems; namely we may introduce an objective 
function (or an evaluation function) and try to find a perfect phylogeny that maximizes the value of the 
function. For example, Gusfield et al. 21 looked at such an optimization problem and formulated it as an 
}J] , integer linear program. One big issue here is that these optimization problems tend to be NP-hard, and 

^_' thus we cannot expect to obtain polynomial-time algorithms. Therefore, we need some compromise. If 

we insist on efficiency, then we need to sacrifice the quality of an obtained solution. This approach leads 
us to approximation algorithms. If we insist on optimality, then we need to sacrifice the running time. 
This approach leads us to exponential-time exact algorithms. However, techniques in the literature as 
Gusfield et al. 0] with these approaches use specific structures of the form of objective functions. 

1.1 Our Results 

The focus of this paper is the exact approach. However, unlike the previous work, we aim at enumera- 
tion algorithms, which give a more flexible framework for scientific discovery independent of the form of 
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objective functions. The use of enumeration algorithms is highhghted in data mining and artificial intel- 
ligence. For example, the apriori algorithm by Agrawal and Srikant [I] enumerates all maximal frequent 
itemsets in a transaction database. It is not expected that such enumeration algorithms run faster than 
non-enumeration algorithms. Therefore, the goal of this paper is to examine a possibility and a limitation 
of enumerative approaches. 

One of the difficulties in designing efficient enumeration algorithms is to avoid duplication. Suppose 
that we are to output an object, and need to check if this object was already output or not. If we store 
all objects that we output so far, then we can check it by going through them. However, storing them 
may take too much space, and going through them may take too much time. The number of obejcts 
is typically exponentially large. Our algorithm cleverly avoids such checks, but still ensures exhaustive 
enumeration without duplication. 

It is rather straightforward to give an algorithm with theoretical guarantee such as polynomiality. 
Namely, a simple branch-and-bound idea gives an algorithm that has a running time polynomial in the 
input size and linear in the output size. Notice that an enumeration algorithm outputs all the objects, 
and thus the running time needs to be at least as high as the number of output objects. Thus, the 
linearity in the output size cannot be avoided in any enumeration algorithms. 

However, such a theoretically-guaranteed algorithm does not necessarily run fast in practice. Thus, 
we propose another algorithm that is based on a zero-suppressed binary decision diagram (ZDD). A ZDD 
was introduced by Minato [TT]. It is a directed graph that has a similar structure to a binary decision 
diagram (BDD). While a BDD is used to represent a boolean function in a compressed way, a ZDD only 
represents the satisfying assignments of the function in a compressed way (a formal definition will be 
given in Section [S]). Furthermore, we may employ a lot of operations on ZDDs, called the family algebra, 
which can be used for efficient filtering and optimization with respect to some objective functions. A 
book of Knuth [TD] devotes one section to ZDDs, and gives numerous applications as exercises. 

Although the size of a constructed ZDD is bounded by a polynomial of the number of output objects, 
we cannot guarantee that the size of a ZDD that is created at the intermediate steps in the course of 
our algorithm is bounded. This means that we cannot guarantee a polynomial-time running time (in 
the input size and the output size) for our ZDD algorithm. However, the crux here is that the size of a 
constructed ZDD can be much smaller than the number of output objects. We exhibit this phenomenon 
in two ways. First, we give an example in which the number of phylogenetic trees is exponential in the 
input size, but the size of the constructed ZDD is polynomial in the input size. Second, we perform 
experiments on randomly generated data, and the result shows that our ZDD algorithm can solve more 
instances than a branch-and-bound algorithm. This suggests that the ZDD approach is quite promising. 

Having enumeration algorithms, we can also count the number of phylogenetic trees. In particular, 
the branch-and-bound algorithm can count them in polynomial time in the input size and the output 
size. This naturally raises the following question: Is it possible to count them in polynomial time only 
in the input size? Note that since we only compute the number, we do not have to output each object 
one by one, and thus the linearity of the running time in the output size could be avoided. Such a 
polynomial-time counting algorithm could be combined with a branch-and-bound enumeration algorithm 
to design a random sampling algorithm. Namely, when we branch, we count the number of outputs in 
each subinstance in polynomial time, and choose one subinstance at random according to the computed 
numbers. For more on the connection of counting and sampling, we refer to a book by Sinclair |13] . 

We prove that this is unlikely. Namely, counting the number of phylogenetic trees for the incom- 
plete directed binary perfect phylogeny is #P-complete. The complexity class ^P contains all counting 
problems in which a counted object has a polynomial-time verifiable certificate. Since no #P-complete 
problem is known to be solved in polynomial time, the #P-completeness suggests the unlikeliness for the 
problem to be solved in polynomial time. 

1.2 Graph Sandwich 

Pe'er et al. [T^ rephrased the incomplete directed binary perfect phylogeny problem as a bipartite graph 
sandwich problem. The graph sandwich problem, in general, was introduced by Golumbic et al. [3|. In 
the graph sandwich problem, we fix a class C of graphs, and we are given two graphs Gi = {V, -Ei), G2 = 
{V, E2) such that Ei C E2- Then, we are asked to find a graph G = (V, E) G C such that Ei C E C E2. 
Golumbic et al. [5] proved that even for some restricted classes of graphs, the problem is NP-complete. 



The subsequent results by various researchers also show that for a lot of cases the problem is NP- 
complete, even though the recognition problem for those classes can be solved in polynomial time (we 
will not include here a long list of literature). Thus, the result by Pe'er et al. [H] gives a rare example 
for which the graph sandwich problem can be solved in polynomial time. 

Recently, the graph sandwich enumeration problem has been studied. Kijima et al. [5] studied the 
graph sandwich enumeration problem for chordal graphs. They provided efficient algorithms when Gi 
or G2 is chordal, where "efficient" means that it runs in polynomial time in the input size and linear 
time in the output size. Their approach was generalized by Heggernes et al. [5] to all sandwich-monotone 
graph classes. In this respect, this paper gives another example of efficient graph sandwich enumeration 
algorithms. 

1.3 Organization 

In Section [21 we introduce the problem more formally. In Section [3J we provide the algorithm based 
on ZDDs, and give an example in which the compression really works. In Section SI we prove that the 
counting version is intractable. Section [S] gives experimental results. We conclude in the final section. 

2 Preliminaries 

Due to the pairwise compatibility lemma (see, e.g., [7]), we may define our problem in terms of laminars. 
We adapt this view throughout the paper. 

A sequence S = {Si , . . . , Sm) of subsets of a finite set 5 is a laminar if for every two i,j e {1, . . . , m\ 
the intersection Si DSj is either Si, Sj, or 0O In the incomplete directed binary perfect phytogeny problem 
(IDBPP), we are given two sequences C = (Li, . . . , Lm), U = (C/i, . . . , Um) of m subsets of S such that 
Li C Ui C S* for all I G {1, . . . , m}, and the question is to determine whether there exists a laminar 
S ~ {Si, . . . , Sm) such that Li ^ Si C Ui for all i G {1, . . . , to}. We call such a laminar a directed binary 
perfect phylogeny for {S,C,U). The IDBPP can be solved in polynomial time [T^ . 

Let us briefly describe the correspondence to phylogcnctic trees. The set 5* represents the set of 
species, and the indices 1, . . . ,m represent the characters. Then, Si represents the set of species that has 
the character i. The species in Li are recognized as those we know having the character i, and the species 
vti S\Ui are recognized as those wc know not having i. 

In this paper, wc consider the following variants that take the same input as the IDBPP. In the 
counting version of IDBPP, the objective is to output the number of directed binary perfect phylogenies. 
In the enumeration version of IDBPP, the objective is to output all the directed binary perfect phylo- 
genies. Note that enumeration should be exhaustive, and also should not output the same object twice 
or more. 

3 ZDD Approach 

3.1 Introduction to ZDDs 

Let /: {0, 1}^ — > {0, 1} be an A^-variate boolean function with boolean variables xi, . . . , xn- We assume 
a linear order on the variables {xi, . . . ,xn} as Xi precedes Xj if and only if i < j. A binary decision 
diagram (BDD) for /, denoted by B{f), is a vertex-labeled directed graph with the following properties. 

— There is only one vertex with indegree 0, called the root of B{f). 

— There are only two vertices with outdegree 0, called the terminals of B{f). 

— Each vertex of B{f), except for the terminals, is labeled by a variable from {xi, . . . , xn}- 

— One terminal is labeled by (called the 0-terminal), and the other terminal is labeled by 1 (called 
the 1-terminal). 

— Each edge of B{f) is labeled by or 1. An edge labeled by is called a 0-edge, and an edge labeled 
by 1 is called a 1-edge. 



^ Usually, a laminar is defined as a family of subsets, but for our purpose it is convenient to define as a sequence 
of subsets. 




— Each vertex of B{f), except for the terminals, has exactly one outgoing 0-edge and exactly one 
outgoing 1-edge. 

— If there is a path from a vertex u to a non-terminal vertex u in B{f), then the label of v is smaller 
than the label of u. 

— A boolean assignment a : {xi, . . . , xn} — >■ {0, 1} satisfies / (i.e., f{a{xi), . . . , a{xN)) = 1) if and only 
if there exists a path P from the root to the 1-terminal in B{f) that satisfies the following condition: 
a{xi) = 1 if and only if there exists a vertex v on P labeled by Xi such that P traverses the 1-edge 
leaving v. 

A BDD for a function / is not unique, and may contain redundant information. 
However, the following reduction rules turn a BDD into a smaller equivalent BDD. A 
zero-suppressed binary decision diagram (ZDD) for a function / is a BDD, denoted 
by Z{f), for which the reduction rules cannot be applied. 

1. If the outgoing 1-edge of a vertex v points to the 0-terminal and the outgoing 
0-edge of a vertex v points to a vertex u, then we remove v and its outgoing edges, 
and reconnect the incoming edges to v to the vertex u. 

2. If two vertices v, v' have the same label a;,;, their outgoing 1-edges point to the same 
vertex mi, and their outgoing 0-edges point to the same vertex mq, then replace 
v,v' with a single vertex w with label Xi. The incoming edges to w are those to pig. i. a ZDD 
v^v' ^ the outgoing 1-edge from w points to ui, and the outgoing 0-edge from w for the function 
points to Mo ■ f{x-i_,x2,x-i) — 

. [xx ^ X2 ^x^) y 

Fig. [1] shows an example of a ZDD. The edges are assumed to be directed downward, (jjr- /y ^j. \ 
A dashed line represents a 0-edgc, and a solid line represents a 1-edgc. 

The size of a ZDD Z{f) is defined as the number of vertices, and denoted by |Z(/)|. 
It is easy to observe that the size of ZDD Z{f) is 0{NA) where A is the number of 
satisfying assignments of /. However, this is merely an upper bound, and in practice 
the size can be much smaller. Thus, a ZDD for / gives a compressed representation of the family of all 
satisfying assignments of /. Especially, if we have a family T of subsets of a finite set S and consider a 
boolean function /: {0, 1}"^ -^ {0, 1} such that f{x) = 1 if and only ii {e ^ S \ Xe ~ 1} E J^, then a ZDD 
for / compactly encodes the family J-". 

There are a family of operations that can be performed on ZDDs. Here, we list those which we 
use in our algorithm. Let /, /': {0,1}^ — 5- {0,1} be boolean functions with variables xi, . . . ,xn, 
and ZDDs Z{f),Z{f') be given. Then, a ZDD Z[f V /') of the disjunction (logical OR) can 
be obtained in 0{\Z{f)\\Z{f)\) time. Let /I^-^ol : {0, l}^-i -^ {0,1} be a boolean function 
with variables xi, . . . , Xi-i,^;^-!, . . . ,a::7v obtained from / by /[^•^'^l (xi, . . . , x^-i, x^+i, . . . , xat) = 
/(a;i, . . . , Xi-i, 0, Xi+i, . . . ,XAr). Then, a ZDD Z(/[^*^°l) can be found in 0(|Z(/)|) time. Similarly, we 
may define /l^'^^l, and a ZDD Z(/[^-=il) can be found in 0(|Z(/)|) time. 

3.2 ZDD-Based Enumeration Algorithm 

We introduce a boolean variable Xi^e for each pair (i,e) of an index i G {1,...,7tj} and an element 
e G S. Then, we consider the conjunction (logical AND) of the following conditions, which gives rise to 
a boolean function /: {o, l}{i-.™}xS ^ ^q^ 

1. For every i S {1, . . . , m}, if e £ Li, then Xi^e = 1- 

2. For every i e {1, . . . , m}, ii e £ S\Ui, then Xi^e = 0. 

3. For every distinct i, j G {1, . . . , m}, exactly one of the following three is satisfied. 

(a) For all e S S*, if Xi^e ~ 1, then Xj^e = 1. 

(b) For all e G 5*, if Xi,e = 0, then Xj^e = 0. 

(c) For all e G 5*, if Xi^e ~ 1, then Xj^e = 0. 

We can easily see that if we set ^i = {e G S* | Xi,^ ~ 1} for every z G {1, . . . , m}, then S = (5*1, . . . , Sm) 
is a directed binary perfect phylogeny for {S,C,U). Namely, the condition 1 translates to Li C Si; the 
condition 2 translates to Si C Ui; the condition 3(a) translates to Si n Sj ~ Si] the condition 3(b) 
translates to Si fl Sj = Sj] the condition 3(c) translates to Si Ci Sj = 0. 
These conditions naturally induce the following algorithm. 



Algorithm: ZDD(S',£,W) 

Precondition: 5* is a finite set, L = (ii, . . . , ini), U = ([/i, . . . , f/m), eacli member of £ and U \s, s. 

subset of 5, and Li C [/j for every i e {!,..., m}. 
Postcondition: Output a ZDD Z(/) for the boolean funetion / over the variables {x^^e | * G 

{1, . . . , TTi}, e G 5} defined above, which encodes all the directed binary perfect phylogenies for 

[.SXM)- 
Step 0: Let 5 = 1 be the constant-one function. Construct a ZDD Z{g). 
Step 1: For each i G {1, . . . , m} and each e G 5, if e G L^, then construct Z{g^^^-^^^^) from Z{g) and 

reset g := g[^i,==il. 
Step 2: For each i G {!,..., m} and each e G 5, if e G S" \ Ui, then construct ^(^[^•^^'^l) from ^(g) 

and reset g := (yrl^'^^*^!. 
Step 3: For each distinct i^j G {1, . . . ,n} and each e G S*, we perform the following. 
Step 3-a: Let gi ■- g[^^.^=-^-^^,.^=n ^ g[x,..=o]^ Construct Z{gi) from Z{g). 
Step 3-b: Let ^2 := 5[^',==0'^^,==ol \/ g[x^..=n, Construct Z{g2) from ^(5). 
Step 3-c: Let 33 ;= 5b.,.=i.^..e=o] y glx,.,=Q]^ Construct Zig^) from Z(.g). 
Step 3-d: Construct Z{gi V ^2 V 33) from Z(gi), ^(32), ^■(^a), and reset g := gi V 52 V 33. 
Step 4: Output Z{g) and halt. 

Although the output size \Z{f)\ is bounded by 0{mnh) where n = \S\ and /i is the number of directed 
binary perfect phylogenies for (S,C,U), we cannot guarantee that ZDDs that appear in the course of 
execution have such a bounded size. Thus, the algorithm could be quite slow or could stop due to memory 
shortage. 

3.3 Example with Huge Compression 

We exhibit an example for which the size of a ZDD is exponentially smaller than the number of directed 
binary perfect phylogenies. While the example is artificial, this indicates a possibility that our ZDD-based 
algorithm outperforms the branch-and-bound algorithm. 

Consider the following example. Let S — {{i,j) \ i G {1, . . . ,n},j G {0, 1, . . . , fc}}. Then \S\ = {k+l)Ti. 
For each I G {1, . . . ,n}, let Li = {(i,0)} and Ui = {(«,0), (i, 1), . . . , (J, fc)}. As before, let £ — (Li, . . . , L„) 
andU = ([/i,...,t/„). 

Proposition 1. The number of directed binary perfect phylogenies for (S, C,U) is 2*"'". 

Proof. For two distinct i,j G {1, . . . , n}, it holds that Ui fl Uj = 0. Therefore, for any subsets 5; C Ui \ Li 
and Sj C Uj \ Lj, it holds that Si D Sj ~ 0. This means that a directed binary perfect phylogeny 
for {S,C,U) can be formed by choosing an arbitrary subset of Ui \ Li for each i G {!,..., n}. Since 
\Ui \ Li\ — k, the number of subsets of Ui \ Li is 2'', and thus the number of directed binary perfect 
phylogenies is (2'"")" = 2^^". □ 

Proposition 2. The size of a ZDD constructed by 7.DD{S,C,U) is 0{kn). 

Proof. Fig. [2] shows a constructed ZDD. Note that an ordering of variables is not relevant. No matter 
which ordering we impose on the variables, we obtain an isomorphic ZDD. D 



4 Hardness of Counting 

As we explained in the introduction, an efficient counting algorithm can be used to efficient sampling of 
combinatorial objects. In this section, we prove that it is unlikely that such an algorithm exists for the 
IDBPP by showing that the counting version is #P-complete. 

Theorem 1. The counting version of the IDBPP is #P-coniplete. 



^(2.1) 



Fig. 2. An example for which the number of directed binary perfect phylogenies is exponential, but the size of a 
ZDD is linear. 



Proof. We reduce the problem of counting the number of inatchings in a (simple) bipartite graph, which 
is known to be #P-complete [T3]- 

Let G = {V, E) be a (simple) bipartite graph with a bipartition V = AiJB oi the vertex set. For each 
vertex v ^V, we set up an element s^,, and let S = (si, \v£V). Then, for each edge e = {a, ^} G £■, 
where a ^ A and b € B, let L^ = {sa} and C/g = {sa,Sh}. Then, we set up £ = (Lg | e G i?) and 
W = {Ue \ e e E). Note that for each e £ E, it holds that Le'^Ue- Thus, 5, £, and U form an instance 
of the IDBPP. 

Let S = {Se I e G £^) be a directed binary perfect phylogeny for {S, C,U). Then, Se is cither Le or Ue 
for every e G E, since \Le\ = 1, \Ue\ = 2, and Le '^ Se Q Ue- 

Claim 1. Let S = {Se | e G -E) be a directed binary perfect phylogeny for {S,C,U). Then, the set 
M = {e <E E \ Se ~ Ue} is a. matching of G. 

Proof (of Claim{J^. Suppose not. Then, there exist two distinct edges e,e' G M that share an cndpoint, 
say V. This means that Sy G SeCi Se>- Since iS is a laminar on 5, it must hold that Se G Se' or Se' G Se- 
Since |S'e| = 2 = |S'e'|, it follows that Se = Se'- Then, e = e' since G is a simple graph. This contradicts 
the assumption that e and e' are distinct edges. D 

The following claim shows the converse. 

Claim 2- Let Af G ii^ be a matching of G. Then, the following S = {Se | e G i?) is a directed binary 
perfect phylogeny for {S, C,U)-- Se ^ Ue ii e E M, and Se = Le otherwise. 

Proof (of Claim\^. It suffices to prove that the constructed sequence 5 is a laminar. Consider two sets 
Se,Se' for two distinct e,e' E E- Wc have three cases. Let e — {a,b} and e' = {a',b'}, where a,a' E A 
and b, b' E B- 

1. Assume that eE M and e' G M. Then, {a, 6} n {a', b'} = 0, and therefore ^e n ^e' = 0. 

2. Assume that e E M and e' ^ M. If a 7^ a', then Se n Se' = 0. If a = a', then Sej = {sa'} Q {sa, Sb} = 
Se- Therefore, Se H Se' = Se' - 

3. Assume that e^M and e' ^ il/. li a =^ a' , then S'^ n S'e' = 0. If a = a', then 5",; = Se'- □ 

By the claims above, the number of matchings in G is equal to the number of directed binary perfect 
phylogenies for £ and U- Note that the reduction runs in polynomial time. D 

5 Experiments 

5.1 Data 

We have used the program ms by Hudson [5] to generate a random data set without incompleteness that 
admits a directed binary perfect phylogeny S = {Si, . . . , Sm)- Then, we have constructed Li from Si by 
removing each element of Si independently with probability p, and constructed Ui from Si by adding 
each element of 5 \ S", independently with probability p- 

We have created 100 instances independently at random for each triple of values {m, n, p) E {50, 100} x 
{50, 100} X {0.1, 0.2, 0.3, 0.4, 0.5}. 

5.2 Implementation and Experiment Environment 

We have implemented the algorithm ZDD described in Section [3] and another algorithm based on the 
branch-and-bound idea, which we call B&B. The details of B&B is deferred to Appendix VK\ We have 
implemented both algorithms in C++. For the implementation of B&B, Step 1 uses the deterministic 
version of Algorithm A in the paper by Pe'er et al. |T21 p. 598], but we have simplified it to gain a 
practical performance. For example, a set is represented by an integer in such a way that each element 
of the set corresponds to a bit in the integer. For (n, m) = (50, 50) we used a 64-bit unsigned long, and 
for other cases we used two unsigned longs. This enables us to perform each set-theoretic operation 
efficiently by one or two bit operations. Further, we only count the number of directed binary perfect 
phylogenies, not outputting all of them, to avoid an inessential computation in time measurement. 



Table 1. The number of solved instances by B&B and ZDD out of 100 for each case. 



{m, n) 


B&B 


ZDD 


(50,50) (50,100)1(100,50)1(100,100) 


(50, 50) (50, 100) 1 (100, 50) | (100, 100) 


p = 0.1 
p = 0.2 


52 



17 











99 

57 


99 
33 


93 
6 


90 

4 



For the implementation of ZDD, we have used the library BDD+ developed by Minato. Among the 
variables in {xi^e | * G {1, ■ • • , rn}, e S S*}, those meeting the condition 2 were removed beforehand, since 
the outgoing 1-edge should point to the 0-terminal. Furthermore, the variables meeting the condition 1 
have been put at the tail of the linear order on all variables. Then, such a variable appears only once as a 
label of a vertex, since the outgoing 0-cdge should point to the 0-tcrminal. These have been implemented 
by combining Steps 0-2 in ZDD. This also affects Step 3: some variables can be further removed, or 
further put at the tail of the linear order. We have tried to find a complete linear order so that the size 
of the constructed ZDD could be small. To this end, we have introduced two heuristic methods. The 
first one has put the variables in the same Si as closely as possible. Since these variables possess heavier 
dependency, if we would put them far, then the ZDD would need to store such dependency at various 
locations. The second one has put the variables in Si and Sj right in front of what were put at the tail, 
and the operations on them corresponding to the condition 3 have been performed later in the execution 
of the algorithm, if Si and Sj meet more than one case in the condition 3. 

All programs have run on the machine with the following specification; OS: SUSE Linux Enterprise 
Server 10 (x86_64); CPU: Quad-Core AMD Opteron(tm) Processor 8393 SE (#CPUs 16, #Processors 
32, Clock Freq. 3092MHz); Memory: 512GB. 

5.3 The Number of Solved Instances 

We have counted the number of instances that were solved by our implementation within two minutes 
for p = 0.1,0.2. Here, "solved" means that the algorithm successfully halts. Table [1] shows the result. As 
wc can see from the table, B&B was not able to solve most of the instances, even if they are small. On 
the other hand, ZDD was able to solve almost all instances when p = 0.1. However, when p = 0.2, the 
number of solved instances rapidly decreases. 

Fig. [3] shows the accumulated number of solved instances by ZDD. Note that the horizontal axis 
is in log-scale. For {m,n,p) = (50,50,0.1), ZDD solved each of the 99 instances within one second. 
For {m,n,p) = (50,100,0.1), it solved each of the 99 instances within five seconds. This shows high 
effectiveness of the algorithm ZDD. 

5.4 The Running Time of ZDD and the Size of ZDDs. 

Fig.|3]shows a scatter plot in which each point represents an instance solved by ZDD for p — 0.1, 0.2 with 
the running time (the horizontal coordinate) and the size of the ZDD constructed by ZDD (the vertical 
coordinate). Note that this is a log- log plot. We can see a tendency that the algorithm spends more time 
for instances with larger ZDDs. A simple £2-regression reveals that the spent time is dependent on the 
size almost linearly. 

5.5 The Number of Perfect Phylogenies and the Size of ZDDs. 

Fig.[5]shows a log-log scatter plot in which each point represents an instance solved by ZDD for p = 0.1, 0.2 
with the number of perfect phylogenies (the horizontal coordinate) and the size of the ZDD constructed 
by ZDD (the vertical coordinate). The plot exhibits high compression rate of ZDDs. If wc define the 
logarithmic compression ratio of ZDD by the logarithm (with base 10) of the size of ZDD divided by 
the number of perfect phylogenies, then Table [5] presents the means and the standard deviations of the 
logarithmic compression ratio of the instances solved by ZDD categorized by the choice of parameters. 
It shows the high-rate compression by ZDDs, and for larger values of parameters the compression ratios 
get larger. Among the solved instances, the logarithmic compression ratios range from —17.77 to —1.82. 
Namely, for the most extreme case, the size of ZDD is approximately 10^^'^^ times smaller than the 
number of perfect phylogenies. 
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Table 2. The means and the standard deviations of logarithmic compression ratios. 



p 

(m, n) 


0.1 


0.2 1 


(50,50) 


(50, 100) 


(100,50) 


(100, 100) 


(50, 50) 


(50, 100) 


(100, 50) 


(100, 100) 


mean 


-4.13 


-7.25 


-5.62 


-10.00 


-8.06 


-13.61 


-9.24 


-14.04 


standard deviation 


1.22 


1.35 


1.74 


1.79 


1.48 


2.04 


1.86 


1.02 



1000 



5.6 The Number of Solutions Found by B&B 

Unlike ZDD, the algorithm B&B can output some directed binary perfect phylogenies even if the ex- 
ecution is interrupted. Fig. [5] shows the averages of the logarithm of the numbers of directed binary 
perfect phylogenies (together with standard deviations) found by B&B within two minutes for each 
case: Four groups correspond to {m,n) — (50, 50), (50, 100), (100, 50), (100, 100) from left to right, and 
in each group there are five bars corresponding to p ~ 0.1,0.2,0.3,0.4,0.5 from left to right. When 
{m,n,p) = (50,50,0.1), the standard deviation is high since about a half of the instances were solved 
within two minutes. Even for the seemingly difficult case {m,n,p) ~ (100,100,0.5), B&B was able to 
find around 10^'* perfect phylogenies. This suggests that B&B can be useful even if ZDD does not finish 
the computation. 



5.7 The Number of Solutions Found by ZDD and B&B. 

Fig. [7] is a scatter plot in which each point represents an instance solved by ZDD with the number of 
directed binary perfect phylogenies found by B&B within two minutes (the horizontal coordinate) and 
the number of directed binary perfect phylogenies in the instance (the vertical coordinate) . This shows 
the percentage of the directed binary perfect phylogenies that were found by B&B. Since this is a log- log 
plot, we can see that this percentage is quite low. There is one instance for {m,n,p) = (100,50,0.2) 
with 49,614,003,829,608,756,019,200 perfect phylogenies for which B&B could only find 991,232. Thus 
the percentage is around 10^^^ %. This really shows the power of ZDDs. 
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Fig. 4. The size of ZDDs and the running time of ZDD. 



5.8 Running Time of ZDD and the Number of Solutions 

Fig.[5]shows a scatter plot in which each point represents an instance solved by ZDD for p — 0.1, 0.2 with 
the running time (the horizontal coordinate) and the number of directed binary perfect phylogcnics in 
the instance (the vertical coordinate). Note that this is a log- log plot. There is a weak tendency that the 
algorithm spends more time for instances with more directed binary perfect phylogenies. We can see that 
the algorithm is able to solve an instance with more than 10^^ perfect phylogenies within one second. 



5.9 The Size of ZDDs During the Execution of ZDD. 



Fig. ini traces the size of ZDDs which are created as intermediate results during the execution of (the 
original version of) the algorithm ZDD. In the plot, there are two curves, each of which corresponds to 
a different instance for (m, n,p) = (50, 50, 0.2). We have measured the size after each execution of Step 
3 in the algorithm. Step 3 is iterated by the number of pairs of distinct integers in {1, . . . ,n}, which 
is (2) = 1,225. Therefore, the horizontal coordinates in the plot range from to 1,224, and the i-th 
iteration gives a point at i—l in the horizontal coordinate. The vertical coordinate corresponds to the 
size of the ZDD. Notice that this is a semi-log plot. 

For the red instance, the algorithm (with heuristic improvements) spent 2.16 seconds to solve, and for 
the green instance, it spent 108.71 seconds to solve. In this sense, the green one is a harder instance than 
the red one. As we can see from the figure, the size of ZDDs are changing over time non-monotonously. 
For the red instance, the size of the final result is 25, 414, while the maximum size during the execution is 
26, 174; the ratio is 1.03. On the other hand, for the green instance, the size of the final result is 144, 100, 
while the maximum size during the execution is 271, 037; the ratio is 1.88. 
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6 Conclusion 

Wc have presented the algorithm ZDD to enumerate all directed binary perfect phylogenies from incom- 
plete data, and compare it with the algorithm B&B based on a simple braneh-and-bound idea. Theoret- 
ically, B&B runs in polynomial time, but ZDD has no such guarantee. In experiments, ZDD solved more 
instances than B&B. This shows some gap between theory and practice, and it is desirable to have some 
theoretical justification why ZDD can outperform. We have theoretically exhibited an example for which 
the compression by a ZDD is effective. However, that example was artificial. The experiments also show 
ZDD can compress very well on random instances. It is desirable to obtain a more natural theoretical 
evidence why such a good compression is achieved. 

The approach by ZDDs looks quite promising, and there must be more problems in bioinformatics 
that can get benefits from them. 
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A Appendix: Details for the Branch-and-Bound Enumeration Algorithm 

In our branch-and-bound algorithm, at every node of a search tree, we make a decision whether a specified 
element e of S' is contained in Sj for a specified index j. The following observation is easy to obtain. 
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Lemma 1. Let S be a finite set, C = (Li, . . . , L„i) and U = {Ui, . . . , Um) be sequences of m subsets 
of S such that Li C Ui C S for all i £ {1, . . . , m}, and S = {Si, . . . , Sm) be a directed binary perfect 
phylogeny for C and U. 

1. If Li ^ Ui for all i G {1, . . . , m\, then S is a unique directed binary perfect phylogeny for C and lA. 

2. If e G Sj \ Lj for some j , then S is a directed binary perfect phylogeny for C and U, where CJ = 
{L[, . . . , L'^) is defined as L[ = Li for i ^ j and L', = Lj U {e}. 

3. If e G Uj \ Sj, for some j, then S is a directed binary perfect phylogeny for C and U' , where 
W = (t/{, . . . , U'^) is defined as U[ = U^ for i ^ j and [/j = U^ \ {e}. D 

Lemma [T] suggests the following algorithm.. Step 1 is the bounding step, and Step 3 is the branching 
step. 

Algorithm: EkE{S,C,U) 

Precondition: S* is a finite set, C — (ii, . . . , L„i), U ~ {Ui, . . . , Um), eaeh member of C and W is a 

subset of S, and Li C Ui for every i G {1, . . . , m}. 
Postcondition: Output all the directed binary perfect phylogenies for {S,C,U). 

Step 1: If there exists no directed binary perfect phylogeny for C and lA, then output nothing and halt. 
Step 2: Otherwise, if Li = Ui for all i G {1, . . . , m}, then set Si = Li for all i G {1, . . . , in], output 

(5*1, . . . , Sm) and halt. 
Step 3: Otherwise, let j G {1, . . . , m] be an arbitrary index such that Lj ^ Uj. Choose an arbitrary 

element e £ Uj \ Lj . 
Step 3-1: Let C := {L[, . . . ,L'^) be defined as L'^ = Li for all i ^ j, and L' = Lj U {e}. Then, run 

B!kB{S,C',U). 
Step 3-2: Let W := ([/{, ...,U^)he defined as U^ = L, for aU i ^ j, and [/j = Uj \ {e}. Then, run 

B>kB{S,C,U'). 
Step 4: Hah. 
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Fig. 8. The number of directed binary perfect phylogenies in the instances solved by ZDD for each case. 



At Step 1, we may use any algorithm to check whether an instance {S, C,U) admits a directed binary 
perfect phylogeny, e.g. one by Pe'er et al. |12| . Their algorithm actually outputs a directed binary perfect 
phylogeny S = (5*1, . . . , S',„) for {S,C,U) if it exists. This S can be used as further information, for 
example at Step 3 of Algorithm B&B. We choose e G t/j \ Lj there. We have two cases. Remind that 
Lj C Sj C Uj (by definition) and Lj ^ Uj (by Step 2). 

1. If e G Sj \ Lj, then in the call BSzB{S,C' ,U) at Step 3-1 we do not have to perform Step 1 since S 
is a directed binary perfect phylogeny for (5*, C ,IA). 

2. If e G Uj \ Sj, then in the call BSzB{S,C,W) at Step 3-2 we do not have to perform Step 1 since S 
is a directed binary perfect phylogeny for {S,C,U'). 

The correctness of the algorithm is immediate. We now bound the running time. The relevant pa- 
rameters are ra, n — \S\, k ~ X]"=i Wi \ ^i\' ^'^d the number h of output directed binary perfect 
phylogenies. Let t{m,n,k) be the worst-case time complexity of the algorithm that we use for Step 1. 
Also, let T{m,n,k,h) be the worst-case time complexity of the execution of B&B(S', £,W) with these 
parameters. If fc = 0, then T{m, n, k, h) ~ 0{mn) since Step 2 already takes 0{mn) time. If h = 0, then 
T{m, n, k, h) = 0{mn) + t{m, n, k). Otherwise, 

T(m, n, k, h) < Tim, n,k — 1, hi) + T{m, n,k — 1, ^,2) + 0{mn) + tim, n, k), 

where h ^ hi + h2. This leads to T{m, n, k, h) < 0{kh{mn + t{m, n, k))). 

If we use the algorithm by Pe'er et al. |12) . which runs in 0{mn) timeO at Step 1, then we obtain 
the following theorem. 

Theorem 2. The execution BSz,B{S, C,U) correctly outputs all the directed binary perfect phylogenies 
for [S,C,U) without duplication in time 0{mnkh) time, where m is the length of the sequences £-,14, 



The O-notation suppresses the polylogarithmic factor. 
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n = \S\, k = X)i=i \Ui\Li\, and the number h of output directed binary perfect phytogenies. In particular, 
each directed binary perfect phylogeny can be found in polynomial time (in the input size) per output, in 
the amortized sense. D 

For the experiment in Scction[Sl we use the deterministic version of Algorithm A in the paper by Pe'er 
et al. |121 p. 598] as a subroutine in Step 1, but we have simphfied it to gain a practical performance. 



15 



